The famous Brandenburger Tor in Berlin
Airbnb is among the most frequently used platforms to book short-term rentals all over the world. In this analysis, we put ourselves in the shoes of a tech-savy couple that currently plans a trip to Berlin and wants to book an apartment via Airbnb. Having access to city-specific Airbnb data, the goal of the analysis is therefore to find a regression model, which predicts the price that this couple would have to pay for a 4-night stay at some Airbnb apartment.
There are three main steps to this analysis: (i) the data exploration and feature selection, (ii) the model selection and validation, (iii) a quick summary on findings and recommendation.
First, we import the relevant libraries and define some of the basic settings for the analysis.
Next, we load the relevant data from insideairbnb.com. We cache this data so that it does not download every time that the document is knitted.
Now that the data is loaded, it helps to understand get a feel for the different variables. This part of the analysis is known as Exploratory Data Analysis. There are three substeps to this:
This tells us that we are looking at more than 18k Airbnb rentals in London, for which we have 74 variables. “Glimpse” also tells us that the variables are in all kinds of formats and likely require some manipulation for the actual analysis. For instance, “host_acceptance_rate” is in
Rows: 18,288
Columns: 74
$ id <dbl> 2015, 3176, 7071, 9991, 1~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped <date> 2021-09-22, 2021-09-22, ~
$ name <chr> "Berlin-Mitte Value! Quie~
$ description <chr> "Great location! <br />3~
$ neighborhood_overview <chr> "It is located in the for~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 2217, 3718, 17391, 33852,~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Ion", "Britta", "BrightR~
$ host_since <date> 2008-08-18, 2008-10-19, ~
$ host_location <chr> "Key Biscayne, Florida, U~
$ host_about <chr> "Isn’t sharing economy gr~
$ host_response_time <chr> "within an hour", "a few ~
$ host_response_rate <chr> "100%", "40%", "100%", "N~
$ host_acceptance_rate <chr> "91%", "100%", "N/A", "0%~
$ host_is_superhost <lgl> TRUE, FALSE, TRUE, FALSE,~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "Mitte", "Prenzlauer Berg~
$ host_listings_count <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_total_listings_count <dbl> 5, 1, 2, 1, 4, 4, 2, 1, 4~
$ host_verifications <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> FALSE, TRUE, TRUE, TRUE, ~
$ neighbourhood <chr> "Berlin, Germany", "Berli~
$ neighbourhood_cleansed <chr> "Brunnenstr. Süd", "Prenz~
$ neighbourhood_group_cleansed <chr> "Mitte", "Pankow", "Panko~
$ latitude <dbl> 52.53305, 52.53471, 52.54~
$ longitude <dbl> 13.40394, 13.41810, 13.41~
$ property_type <chr> "Entire guesthouse", "Ent~
$ room_type <chr> "Entire home/apt", "Entir~
$ accommodates <dbl> 2, 4, 2, 7, 1, 5, 2, 4, 4~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "1 bath", "1 bath", "1 sh~
$ bedrooms <dbl> 1, 1, 1, 4, NA, 1, NA, 2,~
$ beds <dbl> 0, 2, 2, 7, 1, 3, 0, 2, 2~
$ amenities <chr> "[\"Refrigerator\", \"Hea~
$ price <chr> "$77.00", "$90.00", "$33.~
$ minimum_nights <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ maximum_nights <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_minimum_nights <dbl> 33, 62, 1, 6, 90, 60, 5, ~
$ maximum_minimum_nights <dbl> 90, 62, 1, 6, 90, 60, 5, ~
$ minimum_maximum_nights <dbl> 1125, 1125, 10, 14, 1125,~
$ maximum_maximum_nights <dbl> 1125, 1125, 10, 14, 1125,~
$ minimum_nights_avg_ntm <dbl> 88.2, 62.0, 1.0, 6.0, 90.~
$ maximum_nights_avg_ntm <dbl> 1125.0, 1125.0, 10.0, 14.~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 0, 9, 0, 0, 0, 0, 0, 3, 0~
$ availability_60 <dbl> 21, 9, 0, 0, 1, 0, 4, 31,~
$ availability_90 <dbl> 51, 9, 0, 0, 31, 0, 4, 61~
$ availability_365 <dbl> 326, 93, 0, 0, 102, 144, ~
$ calendar_last_scraped <date> 2021-09-22, 2021-09-22, ~
$ number_of_reviews <dbl> 143, 147, 293, 8, 26, 48,~
$ number_of_reviews_ltm <dbl> 10, 1, 0, 0, 1, 0, 21, 2,~
$ number_of_reviews_l30d <dbl> 1, 0, 0, 0, 0, 0, 3, 0, 0~
$ first_review <date> 2016-04-11, 2010-12-21, ~
$ last_review <date> 2021-07-22, 2017-03-20, ~
$ review_scores_rating <dbl> 4.66, 4.63, 4.83, 5.00, 4~
$ review_scores_accuracy <dbl> 4.79, 4.68, 4.85, 5.00, 5~
$ review_scores_cleanliness <dbl> 4.52, 4.53, 4.90, 5.00, 4~
$ review_scores_checkin <dbl> 4.88, 4.64, 4.86, 5.00, 4~
$ review_scores_communication <dbl> 4.89, 4.69, 4.85, 5.00, 4~
$ review_scores_location <dbl> 4.96, 4.92, 4.91, 4.86, 4~
$ review_scores_value <dbl> 4.59, 4.63, 4.71, 4.86, 4~
$ license <chr> NA, NA, NA, "03/Z/RA/0034~
$ instant_bookable <lgl> FALSE, FALSE, TRUE, FALSE~
$ calculated_host_listings_count <dbl> 5, 1, 1, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_entire_homes <dbl> 5, 1, 0, 1, 3, 2, 1, 1, 2~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 2.15, 1.12, 2.40, 0.16, 0~
Using “favstats”, we can get a feel for the values that individual variables take on. We chose “accommodates”, “review_scores_rating”, “number_of_reviews”, and “beds” because our intuive sense was that these could all impact price in our eventual regression model.
From “favstats”, we learn that the median for accommodates is 2, while the maximum goes up to 16. Also, the average Airbnb rental has a review score of c. 4.6. Finally, there is one Airbnb with 17 beds. These are just some exemplary figures from this descriptive analysis that help us to get a better feel for the data. Also notice that we cannot yet run the command on “price”, since it is still saved as a character variable.
Using “skim”, we can see that there are certain variables where many values are missing (e.g., host_about). It is good to see that “price”, our dependent variable in the regression model, is not missing for any of the rentals.
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 2 | 3 | 16 | 2.714129 | 1.619647 | 18288 | 0 |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.61 | 4.85 | 5 | 5 | 4.626417 | 0.8043395 | 14716 | 3572 |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 4 | 17 | 655 | 22.78904 | 51.01942 | 18288 | 0 |
| min | Q1 | median | Q3 | max | mean | sd | n | missing | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 2 | 17 | 1.624439 | 1.244291 | 18061 | 227 |
| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | Date.min | Date.max | Date.median | Date.n_unique | logical.mean | logical.count | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | listing_url | 0 | 1.0000000 | 33 | 37 | 0 | 18288 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | name | 29 | 0.9984143 | 1 | 255 | 0 | 17766 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | description | 544 | 0.9702537 | 1 | 1000 | 0 | 17156 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighborhood_overview | 8702 | 0.5241689 | 1 | 1000 | 0 | 8570 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | picture_url | 0 | 1.0000000 | 60 | 126 | 0 | 18047 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_url | 0 | 1.0000000 | 38 | 43 | 0 | 14776 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_name | 16 | 0.9991251 | 1 | 35 | 0 | 5177 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_location | 59 | 0.9967738 | 1 | 199 | 0 | 952 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_about | 9327 | 0.4899934 | 1 | 5095 | 0 | 6642 | 21 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_time | 16 | 0.9991251 | 3 | 18 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_rate | 16 | 0.9991251 | 2 | 4 | 0 | 66 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_acceptance_rate | 16 | 0.9991251 | 2 | 4 | 0 | 99 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_thumbnail_url | 16 | 0.9991251 | 55 | 106 | 0 | 14674 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_picture_url | 16 | 0.9991251 | 57 | 109 | 0 | 14674 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_neighbourhood | 6091 | 0.6669401 | 1 | 28 | 0 | 165 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_verifications | 0 | 1.0000000 | 2 | 158 | 0 | 318 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood | 8702 | 0.5241689 | 7 | 43 | 0 | 50 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood_cleansed | 0 | 1.0000000 | 4 | 41 | 0 | 137 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood_group_cleansed | 0 | 1.0000000 | 5 | 24 | 0 | 12 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | property_type | 0 | 1.0000000 | 3 | 35 | 0 | 68 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | room_type | 0 | 1.0000000 | 10 | 15 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | bathrooms_text | 26 | 0.9985783 | 6 | 17 | 0 | 27 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | amenities | 0 | 1.0000000 | 2 | 1416 | 0 | 15257 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | price | 0 | 1.0000000 | 5 | 9 | 0 | 430 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | license | 16019 | 0.1240704 | 3 | 342 | 0 | 1921 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | last_scraped | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2021-09-21 | 2021-10-03 | 2021-09-22 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | host_since | 16 | 0.9991251 | NA | NA | NA | NA | NA | 2008-08-08 | 2021-09-20 | 2015-09-16 | 3562 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | calendar_last_scraped | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2021-09-21 | 2021-10-03 | 2021-09-22 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | first_review | 3572 | 0.8046807 | NA | NA | NA | NA | NA | 2010-12-21 | 2021-09-22 | 2018-07-10 | 2771 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | last_review | 3572 | 0.8046807 | NA | NA | NA | NA | NA | 2012-07-08 | 2021-09-26 | 2019-09-28 | 2226 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_is_superhost | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.1545534 | FAL: 15448, TRU: 2824 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_has_profile_pic | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.9948555 | TRU: 18178, FAL: 94 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_identity_verified | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.7887478 | TRU: 14412, FAL: 3860 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | bathrooms | 18288 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | calendar_updated | 18288 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | has_availability | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.9803696 | TRU: 17929, FAL: 359 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | instant_bookable | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.3035324 | FAL: 12737, TRU: 5551 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.557156e+07 | 1.540011e+07 | 2.015000e+03 | 1.218794e+07 | 2.385470e+07 | 3.968697e+07 | 5.238006e+07 | <U+2587><U+2587><U+2587><U+2586><U+2587> |
| numeric | scrape_id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.021092e+13 | 0.000000e+00 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| numeric | host_id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.337946e+07 | 1.083088e+08 | 1.581000e+03 | 1.194556e+07 | 4.352120e+07 | 1.449065e+08 | 4.238179e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| numeric | host_listings_count | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.556042e+00 | 4.036450e+01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.010000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | host_total_listings_count | 16 | 0.9991251 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.556042e+00 | 4.036450e+01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 2.010000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | latitude | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.250997e+01 | 3.244370e-02 | 5.234007e+01 | 5.248953e+01 | 5.250974e+01 | 5.253325e+01 | 5.265611e+01 | <U+2581><U+2581><U+2587><U+2583><U+2581> |
| numeric | longitude | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.340509e+01 | 6.332170e-02 | 1.309715e+01 | 1.336797e+01 | 1.341485e+01 | 1.343918e+01 | 1.375736e+01 | <U+2581><U+2582><U+2587><U+2581><U+2581> |
| numeric | accommodates | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.714129e+00 | 1.619647e+00 | 0.000000e+00 | 2.000000e+00 | 2.000000e+00 | 3.000000e+00 | 1.600000e+01 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| numeric | bedrooms | 1609 | 0.9120188 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.271779e+00 | 6.272113e-01 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.200000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | beds | 227 | 0.9875875 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.624439e+00 | 1.244291e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.700000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.324256e+00 | 3.423886e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.883037e+02 | 5.282968e+02 | 1.000000e+00 | 2.800000e+01 | 3.650000e+02 | 1.125000e+03 | 5.000000e+03 | <U+2587><U+2587><U+2581><U+2581><U+2581> |
| numeric | minimum_minimum_nights | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.220867e+00 | 3.415314e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_minimum_nights | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.878657e+00 | 3.535246e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_maximum_nights | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.704186e+05 | 3.175798e+07 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_maximum_nights | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.878616e+05 | 3.550553e+07 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_nights_avg_ntm | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 9.597167e+00 | 3.451717e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 1.124000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_nights_avg_ntm | 1 | 0.9999453 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.875923e+05 | 3.548948e+07 | 1.000000e+00 | 3.000000e+01 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | availability_30 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.806758e+00 | 7.721591e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 3.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | availability_60 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.019412e+01 | 1.775509e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.500000e+01 | 6.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | availability_90 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.831753e+01 | 2.927484e+01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.500000e+01 | 9.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | availability_365 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.556086e+01 | 1.245070e+02 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.620000e+02 | 3.650000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2582> |
| numeric | number_of_reviews | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.278904e+01 | 5.101942e+01 | 0.000000e+00 | 1.000000e+00 | 4.000000e+00 | 1.700000e+01 | 6.550000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | number_of_reviews_ltm | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.679899e+00 | 9.356744e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 4.470000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | number_of_reviews_l30d | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.519904e-01 | 1.654162e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.040000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | review_scores_rating | 3572 | 0.8046807 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.626417e+00 | 8.043395e-01 | 0.000000e+00 | 4.610000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_accuracy | 3897 | 0.7869094 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.791855e+00 | 4.112054e-01 | 0.000000e+00 | 4.750000e+00 | 4.920000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_cleanliness | 3895 | 0.7870188 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.637258e+00 | 5.258907e-01 | 0.000000e+00 | 4.500000e+00 | 4.800000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_checkin | 3909 | 0.7862533 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.826007e+00 | 3.900019e-01 | 0.000000e+00 | 4.800000e+00 | 4.960000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_communication | 3898 | 0.7868548 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.828607e+00 | 3.978867e-01 | 0.000000e+00 | 4.810000e+00 | 4.970000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_location | 3908 | 0.7863080 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.759599e+00 | 3.838505e-01 | 0.000000e+00 | 4.670000e+00 | 4.880000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_value | 3910 | 0.7861986 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.668290e+00 | 4.501632e-01 | 0.000000e+00 | 4.550000e+00 | 4.760000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | calculated_host_listings_count | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.025153e+00 | 7.454440e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 7.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | calculated_host_listings_count_entire_homes | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.942257e+00 | 5.416078e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.400000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | calculated_host_listings_count_private_rooms | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.859361e-01 | 3.247792e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 4.500000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | calculated_host_listings_count_shared_rooms | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.392170e-01 | 2.017200e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.800000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | reviews_per_month | 3572 | 0.8046807 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.155416e-01 | 1.577983e+00 | 1.000000e-02 | 9.000000e-02 | 3.000000e-01 | 1.000000e+00 | 9.086000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
In a next step, we transform “price” and some of the other variables into numerics. Also, we use “ggpairs” to get a feel for the correlation between some of the variables. For instance, it is interesting to find out whether “accommodates” correlates with “minimum_nights”. Our intuition was that very large Airbnbs may have a higher minimum_nights number, since the cleaning effort for the host is increased.
The output below indicates that this intuition is not confirmed by the data, since there is actually a slightly negative correlation between minimum_nights and accommodates. As one would expect, a higher number of accommodates is correlated with a higher price. The density plots also help us see that for example review_scores_rating is left-skewed with a large number of rentals having very high ratings. Another interesting observation is that maximum_nights has a peak at 365, which means that many rentals cannot be booked for more than a year. This may be due to regulatory reasons, which keeps hosts to from renting out their properties for very long periods of time.
These are some other questions that we can now answer
How many variables/columns? How many rows/observations?
There are 74 variables and 18,288 observations.
Which variables are numbers?
The following variables are numbers: id, scrape_id, host_id, latitude, longitude, accommondates, bathrooms, bedrooms, beds, price, maximum_nights, minimum_nights, number_of_reviews, number_of_reviews_ltm, number_of_reviews_130d, reviews_per_month,calculated_host_listings_count, calculated_host_listings_count_entire_homes, calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms, reviews_per_month;
Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?)
The following variables are factors: host_response_rate, host_acceptance_rate, host_is_superhost, host_has_profile_pic, host_identity_verified, review_scores_rating, review_scores_accuracy, review_scores_cleanliness, review_scores_checkin, review_scores_communication, review_scores_location, review_scores_value, instant_bookable;
What are the correlations between variables? Does each scatterplot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?
We were not able to observe strong correlation between any of the variables we selected for testing. It therefore appears that there is no linear relationship between the price, the accommodates, number of reviews, review scores rating, maximum or minimum nights. We log-transform the price variable at a later stage in order to normalize higher dispersion in very expensive rentals.
We are now at the third step of the Exploratory Data Analysis section.
In this step, we plot some graphs in order to deepen our understanding of how different variables are distributed. We do not exclusively focus on variables and relationships that may impact price in our regression model, but rather try to get a feel of the dataset in general.
In the first chart, we learn that the distribution of beds varies with the nr. of accommodates of a specific rental. This is a rather straight-forward relationship, but it helps to start with something that confirms the intuition. In general, the interquartile range increases with the nr. of accommodates per Airbnb. One can assume that this is due to any extra beds in the form of sofa beds, which are likely more frequent in larger rentals. These more “improvised” beds are less likely to be found in smaller rentals.
The second chart tells us that superhosts (those with many rentals and a lot of experience) have a higher median review rating and a smaller interquartile range. One can assume that superhosts more consistently provide a high quality rental experience and therefore the spread of different ratings is smaller. We can also see that there are certain rentals for which the data set does not provide information on host status (“NA”).
The third chart shows that review ratings among different room types vary. Shared rooms tend to have the worst ratings, which is likely due to the fact that the rental experience is dependent on another visitor.
The fourth chart shows the availability of rentals in different neighbourhoods. For example, in “Mitte” the availability is a lot lower than in Spandau. This is likely due to the fact that Mitte is in a very central location, where the demand for Airbnbs is really high.
For the fifth chart, we filter out all the rentals that have a price >400 to avoid the distorting effect of very expensive rentals. In a later step, we will log-transform the price variable to achieve this. For now, the chart tells us that different room types have different price distributions. The hotel room category, where you also pay for using the amenities of the respective hotel, is unsurprisingly the most expensive one. What is more interesting is that shared rooms and private rooms have very similar distributions. One reason could be that shared rooms are a lot larger, which makes up for the lack of privacy in terms of price.
From the sixth chart, we learn that whether a host has a profile picture seems to impact the communication rating for a specific rental. Hosts that have a picture tend to score higher in this category. After all, Airbnb customers seem to like to see who their host is and incorporate that into the communication rating they give.
Now, we focus on getting our data set in the right format for our regression analysis.
First, we look at the variable property_type. We can use the count function to determine how many categories there are and their frequency. The four most common property types are entire rental units (~50.0%), private rooms in rental units (~35.7%), entire condominiums (~2.7%), and entire serviced apartments (~2.0%). Together, these property types make up for ~90.3% of the whole sample.
| property_type | count | prop_in_percentage |
|---|---|---|
| Entire rental unit | 8778 | 47.9986877 |
| Private room in rental unit | 6534 | 35.7283465 |
| Entire condominium (condo) | 485 | 2.6520122 |
| Entire serviced apartment | 362 | 1.9794401 |
| Entire loft | 327 | 1.7880577 |
| Private room in residential home | 237 | 1.2959318 |
| Private room in condominium (condo) | 219 | 1.1975066 |
| Entire residential home | 183 | 1.0006562 |
| Room in hotel | 175 | 0.9569116 |
| Shared room in rental unit | 117 | 0.6397638 |
| Room in boutique hotel | 96 | 0.5249344 |
| Private room in loft | 80 | 0.4374453 |
| Shared room in hostel | 75 | 0.4101050 |
| Private room in bed and breakfast | 68 | 0.3718285 |
| Entire guesthouse | 56 | 0.3062117 |
| Private room in townhouse | 55 | 0.3007437 |
| Private room in hostel | 48 | 0.2624672 |
| Room in serviced apartment | 47 | 0.2569991 |
| Entire guest suite | 32 | 0.1749781 |
| Room in aparthotel | 31 | 0.1695101 |
| Private room in serviced apartment | 29 | 0.1585739 |
| Entire bungalow | 25 | 0.1367017 |
| Entire townhouse | 24 | 0.1312336 |
| Houseboat | 18 | 0.0984252 |
| Private room | 18 | 0.0984252 |
| Private room in guest suite | 13 | 0.0710849 |
| Private room in pension | 13 | 0.0710849 |
| Entire place | 10 | 0.0546807 |
| Boat | 9 | 0.0492126 |
| Camper/RV | 9 | 0.0492126 |
| Private room in guesthouse | 9 | 0.0492126 |
| Room in hostel | 9 | 0.0492126 |
| Private room in villa | 8 | 0.0437445 |
| Tiny house | 8 | 0.0437445 |
| Entire villa | 7 | 0.0382765 |
| Entire cabin | 6 | 0.0328084 |
| Private room in casa particular | 6 | 0.0328084 |
| Private room in tiny house | 5 | 0.0273403 |
| Shared room in condominium (condo) | 5 | 0.0273403 |
| Entire cottage | 4 | 0.0218723 |
| Private room in boat | 4 | 0.0218723 |
| Private room in bungalow | 4 | 0.0218723 |
| Shared room in boutique hotel | 4 | 0.0218723 |
| Shared room in loft | 3 | 0.0164042 |
| Shared room in residential home | 3 | 0.0164042 |
| Entire home/apt | 2 | 0.0109361 |
| Private room in cottage | 2 | 0.0109361 |
| Room in bed and breakfast | 2 | 0.0109361 |
| Shared room in bed and breakfast | 2 | 0.0109361 |
| Shared room in serviced apartment | 2 | 0.0109361 |
| Shared room in tiny house | 2 | 0.0109361 |
| Treehouse | 2 | 0.0109361 |
| Bus | 1 | 0.0054681 |
| Casa particular | 1 | 0.0054681 |
| Castle | 1 | 0.0054681 |
| Earth house | 1 | 0.0054681 |
| Entire chalet | 1 | 0.0054681 |
| Floor | 1 | 0.0054681 |
| Island | 1 | 0.0054681 |
| Private room in cave | 1 | 0.0054681 |
| Private room in floor | 1 | 0.0054681 |
| Private room in houseboat | 1 | 0.0054681 |
| Private room in tipi | 1 | 0.0054681 |
| Shared room | 1 | 0.0054681 |
| Shared room in boat | 1 | 0.0054681 |
| Shared room in cabin | 1 | 0.0054681 |
| Shared room in townhouse | 1 | 0.0054681 |
| Shipping container | 1 | 0.0054681 |
Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other.
We can quickly check if the simplification worked.
| property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 8778 |
| Private room in rental unit | Private room in rental unit | 6534 |
| Entire condominium (condo) | Entire condominium (condo) | 485 |
| Entire serviced apartment | Entire serviced apartment | 362 |
| Entire loft | Other | 327 |
| Private room in residential home | Other | 237 |
| Private room in condominium (condo) | Other | 219 |
| Entire residential home | Other | 183 |
| Room in hotel | Other | 175 |
| Shared room in rental unit | Other | 117 |
| Room in boutique hotel | Other | 96 |
| Private room in loft | Other | 80 |
| Shared room in hostel | Other | 75 |
| Private room in bed and breakfast | Other | 68 |
| Entire guesthouse | Other | 56 |
| Private room in townhouse | Other | 55 |
| Private room in hostel | Other | 48 |
| Room in serviced apartment | Other | 47 |
| Entire guest suite | Other | 32 |
| Room in aparthotel | Other | 31 |
| Private room in serviced apartment | Other | 29 |
| Entire bungalow | Other | 25 |
| Entire townhouse | Other | 24 |
| Houseboat | Other | 18 |
| Private room | Other | 18 |
| Private room in guest suite | Other | 13 |
| Private room in pension | Other | 13 |
| Entire place | Other | 10 |
| Boat | Other | 9 |
| Camper/RV | Other | 9 |
| Private room in guesthouse | Other | 9 |
| Room in hostel | Other | 9 |
| Private room in villa | Other | 8 |
| Tiny house | Other | 8 |
| Entire villa | Other | 7 |
| Entire cabin | Other | 6 |
| Private room in casa particular | Other | 6 |
| Private room in tiny house | Other | 5 |
| Shared room in condominium (condo) | Other | 5 |
| Entire cottage | Other | 4 |
| Private room in boat | Other | 4 |
| Private room in bungalow | Other | 4 |
| Shared room in boutique hotel | Other | 4 |
| Shared room in loft | Other | 3 |
| Shared room in residential home | Other | 3 |
| Entire home/apt | Other | 2 |
| Private room in cottage | Other | 2 |
| Room in bed and breakfast | Other | 2 |
| Shared room in bed and breakfast | Other | 2 |
| Shared room in serviced apartment | Other | 2 |
| Shared room in tiny house | Other | 2 |
| Treehouse | Other | 2 |
| Bus | Other | 1 |
| Casa particular | Other | 1 |
| Castle | Other | 1 |
| Earth house | Other | 1 |
| Entire chalet | Other | 1 |
| Floor | Other | 1 |
| Island | Other | 1 |
| Private room in cave | Other | 1 |
| Private room in floor | Other | 1 |
| Private room in houseboat | Other | 1 |
| Private room in tipi | Other | 1 |
| Shared room | Other | 1 |
| Shared room in boat | Other | 1 |
| Shared room in cabin | Other | 1 |
| Shared room in townhouse | Other | 1 |
| Shipping container | Other | 1 |
Next, we look at the Minimum_nihts variabe to only include listings in our regression analysis that are intended for travel purposes. At first, we check the distribution of minimum_nights.
| minimum_nights | count |
|---|---|
| 2 | 4236 |
| 1 | 4194 |
| 3 | 3282 |
| 4 | 1368 |
| 5 | 1293 |
| 7 | 864 |
| 30 | 418 |
| 6 | 382 |
| 14 | 363 |
| 60 | 298 |
| 10 | 284 |
| 90 | 195 |
| 20 | 117 |
| 28 | 95 |
| 15 | 82 |
| 8 | 75 |
| 21 | 72 |
| 180 | 53 |
| 12 | 43 |
| 9 | 39 |
| 25 | 37 |
| 13 | 33 |
| 29 | 32 |
| 61 | 31 |
| 62 | 28 |
| 22 | 26 |
| 120 | 23 |
| 31 | 17 |
| 183 | 16 |
| 45 | 14 |
| 93 | 13 |
| 18 | 11 |
| 150 | 11 |
| 16 | 10 |
| 89 | 10 |
| 91 | 10 |
| 40 | 9 |
| 58 | 9 |
| 100 | 9 |
| 357 | 9 |
| 11 | 7 |
| 19 | 7 |
| 50 | 7 |
| 56 | 7 |
| 365 | 7 |
| 23 | 6 |
| 27 | 6 |
| 181 | 6 |
| 65 | 5 |
| 92 | 5 |
| 200 | 5 |
| 1000 | 5 |
| 17 | 4 |
| 55 | 4 |
| 63 | 4 |
| 70 | 4 |
| 80 | 4 |
| 85 | 4 |
| 300 | 4 |
| 24 | 3 |
| 42 | 3 |
| 59 | 3 |
| 99 | 3 |
| 118 | 3 |
| 182 | 3 |
| 186 | 3 |
| 500 | 3 |
| 1124 | 3 |
| 26 | 2 |
| 33 | 2 |
| 35 | 2 |
| 83 | 2 |
| 84 | 2 |
| 140 | 2 |
| 185 | 2 |
| 240 | 2 |
| 360 | 2 |
| 34 | 1 |
| 37 | 1 |
| 48 | 1 |
| 49 | 1 |
| 51 | 1 |
| 71 | 1 |
| 75 | 1 |
| 82 | 1 |
| 87 | 1 |
| 88 | 1 |
| 98 | 1 |
| 101 | 1 |
| 105 | 1 |
| 119 | 1 |
| 122 | 1 |
| 125 | 1 |
| 128 | 1 |
| 129 | 1 |
| 170 | 1 |
| 179 | 1 |
| 184 | 1 |
| 187 | 1 |
| 188 | 1 |
| 210 | 1 |
| 250 | 1 |
| 270 | 1 |
| 304 | 1 |
| 355 | 1 |
| 356 | 1 |
| 720 | 1 |
| 1100 | 1 |
We can now answer some more questions
What are the most common values for the variable minimum_nights?
The most common values for the variable minimum_nights are 2, 1, and 3 nights. This answer also makes sense, given many people use Airbnb for city trips, so the mininmal duration should not be too limited, but short stays and the cost or work to clean an Airbnb for a one night booking might not be worth it for many hosts.
Is there any value among the common values that stands out?
Especially the 30, 14 and 60 night minimum limits stand out at a first glance. These are usually longer-term Airbnbs that are used by interns or workers that are on assembly trips. It is also logical for some landlords to rent out their rooms over the longer term, as also for a longer stay the room only has to be tided once. The highest minimum night requirement is 1,124 nights. This observation must be investigated further to understand the reason behind such a high value.
What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?
The usual reasons for these longer minimum stays are to draw bookings from people that are on work projects, internships or are looking for a temporary stay while looking for a permanent accommodation. The benefit for the host is the lower frequency of cleaning and setting up the rooms.
Next, we filter the airbnb data so that it only includes observations with minimum_nights <= 4.
After making these adjustments, we want to analyze the distribution of rentals in Berlin. As the chart below shows, there are certain quarters with particularly many rentals. For instance, in Kreuzberg (a southern quarter in the city), there are many rentals available. This may be due to the types of buildings and the general infrastructure in the area. Kreuzberg is home to many restaurants and bars, which makes it an interesting area for tourists. Interestingly, there are fewer Airbnb in the heart of the city. Likely this is because the political district as well as many high-end hotels are located here, which leaves less room for Airbnbs.
#data visualization that assigns each rental to a specific map location using longitude and latitude figures
leaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 0.5,
fillColor = "red",
fillOpacity = 0.3,
popup = ~listing_url,
label = ~property_type)As we get closer to our regression model, we create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
In the next section, we create a new column called “log(price_4_nights)”. We should use log(price_4_nights) because there are some outlier dentals in price_4_nights and using log(price_4_nights) could help normalize the dataset. In addition, the use of log can make the distribution behave better and help with finding the regression model. The regression model assumes normality and running a log-transformation helps to come closer to this assumption. It also ensures that the assumption of constant variance is met.
We can use histograms to examine the distributions of price_4_nights and log(price_4_nights).
We now have all variables in the correct format and can start model selection and validation.We start with a model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating. Before running the first model, we split the data into a training and testing part. This will be key in order to test the explanatory power of the model when applying it to data that it has not been trained on - more on this later.
Estimate Std. Error t value
(Intercept) 5.253e+00 5.471e-02 96.015
prop_type_simplifiedEntire rental unit -1.236e-01 3.786e-02 -3.266
prop_type_simplifiedEntire serviced apartment 3.699e-01 5.870e-02 6.301
prop_type_simplifiedOther -1.397e-01 4.069e-02 -3.434
prop_type_simplifiedPrivate room in rental unit -5.145e-01 3.817e-02 -13.478
number_of_reviews -2.629e-04 9.906e-05 -2.654
review_scores_rating 4.414e-02 8.538e-03 5.170
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire rental unit 0.001098 **
prop_type_simplifiedEntire serviced apartment 3.14e-10 ***
prop_type_simplifiedOther 0.000597 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
number_of_reviews 0.007968 **
review_scores_rating 2.41e-07 ***
Residual standard error: 0.489 on 6786 degrees of freedom
Multiple R-squared: 0.1531, Adjusted R-squared: 0.1524
F-statistic: 204.5 on 6 and 6786 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.012987 4 1.001614
number_of_reviews 1.015617 1 1.007778
review_scores_rating 1.006425 1 1.003207
Because the dependent variable (i.e., price_4_nights) is log-transformed, the interpretation of the coefficients requires one additional step. The coefficient has to be exponentiated to reverse the log-transformation: (e4.41x10-2-1)x100=4.5087. This adjusted coefficient means that for every unit change in review_scores_rating, the price_4_nights increases by about 4.5%. This makes intuitive sense: the higher the rating, the more the host can charge. The t-value of >6 indicates that this relationship is statistically significant.
To interpret the coefficients, they have to be transformed like in the previous section. This leads to the following values:
prop_type_simplifiedEntire rental unit: -11.66 prop_type_simplifiedEntire serviced apartment: 44.77 prop_type_simplifiedOther: -13.06 prop_type_simplifiedPrivate room in rental unit: -40.19
The variable “Entire condominium (condo)” is taken as the base value. Hence, the coefficients correspond to the %-change in price_4_nights over the base case that the Airbnb is of prop_type “Entire condominium (condo)”. For instance, if you rent an “Entire serviced apartment”, the price_4_nights is increased by 52% over the price that it would cost you if you had rented an “Entire condominium (condo)”. The same logic also applies to the other variables, which are also all statistically significant. It also makes intuitive sense that for example “Entire serviced apartments” will be significantly more costly, because you pay for amenities such as regular cleaning or even breakfast. In a further analysis, one could split up the “Other” category further, to find out more about other property types.
Next, we want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. We fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.
| room_type | count |
|---|---|
| Entire home/apt | 3782 |
| Private room | 2869 |
| Hotel room | 76 |
| Shared room | 66 |
Estimate Std. Error t value
(Intercept) 5.299e+00 5.259e-02 100.771
prop_type_simplifiedEntire rental unit -1.243e-01 3.637e-02 -3.417
prop_type_simplifiedEntire serviced apartment 3.678e-01 5.639e-02 6.523
prop_type_simplifiedOther 9.154e-03 4.488e-02 0.204
prop_type_simplifiedPrivate room in rental unit -2.505e-01 5.186e-02 -4.829
number_of_reviews -2.319e-04 9.522e-05 -2.435
review_scores_rating 3.404e-02 8.213e-03 4.144
room_typeHotel room 6.854e-01 6.056e-02 11.318
room_typePrivate room -2.643e-01 3.673e-02 -7.197
room_typeShared room -1.120e+00 6.413e-02 -17.465
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire rental unit 0.000637 ***
prop_type_simplifiedEntire serviced apartment 7.38e-11 ***
prop_type_simplifiedOther 0.838402
prop_type_simplifiedPrivate room in rental unit 1.40e-06 ***
number_of_reviews 0.014915 *
review_scores_rating 3.45e-05 ***
room_typeHotel room < 2e-16 ***
room_typePrivate room 6.84e-13 ***
room_typeShared room < 2e-16 ***
Residual standard error: 0.4697 on 6783 degrees of freedom
Multiple R-squared: 0.219, Adjusted R-squared: 0.218
F-statistic: 211.3 on 9 and 6783 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 11.910891 4 1.362991
number_of_reviews 1.017097 1 1.008512
review_scores_rating 1.009314 1 1.004646
room_type 11.887313 3 1.510708
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.2260167 0.0396760 131.717 < 2e-16 ***
number_of_reviews -0.0002115 0.0000967 -2.187 0.028743 *
review_scores_rating 0.0295956 0.0083487 3.545 0.000395 ***
room_typeHotel room 0.7884778 0.0553871 14.236 < 2e-16 ***
room_typePrivate room -0.3873318 0.0118796 -32.605 < 2e-16 ***
room_typeShared room -1.0191393 0.0594124 -17.154 < 2e-16 ***
Residual standard error: 0.4779 on 6787 degrees of freedom
Multiple R-squared: 0.1908, Adjusted R-squared: 0.1902
F-statistic: 320 on 5 and 6787 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.012971 1 1.006464
review_scores_rating 1.007178 1 1.003583
room_type 1.010981 3 1.001822
There is some multicollinearity between room_type and property_type, as one would expect. Because room_type adds more explanatory power to the model, we therefore exclude property_type from the model. All room_type variables are statistically significant and tell us different things about price_4_nights:
We now go on by adding other variables to the model to increase its explanatory power. Currently, we can only explain c. 19% of the variation in price with our model. We therefore include more variables to improve on this. Model3 includes the number of bathrooms, bedrooms, beds, and size of the house (accomodates) of a rental.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.416e+00 4.046e-02 133.838 < 2e-16 ***
number_of_reviews 1.058e-04 9.369e-05 1.129 0.25879
review_scores_rating 2.466e-02 8.000e-03 3.082 0.00206 **
room_typeHotel room 7.603e-01 5.311e-02 14.315 < 2e-16 ***
room_typePrivate room -4.753e-01 1.230e-02 -38.646 < 2e-16 ***
room_typeShared room -8.660e-01 5.841e-02 -14.826 < 2e-16 ***
bedrooms 1.699e-01 1.249e-02 13.610 < 2e-16 ***
beds 5.877e-03 6.778e-03 0.867 0.38589
accommodates -1.202e-01 6.123e-03 -19.624 < 2e-16 ***
Residual standard error: 0.4578 on 6784 degrees of freedom
Multiple R-squared: 0.258, Adjusted R-squared: 0.2571
F-statistic: 294.8 on 8 and 6784 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.036525 1 1.018099
review_scores_rating 1.008184 1 1.004084
room_type 1.247866 3 1.037595
bedrooms 2.291741 1 1.513850
beds 2.975669 1 1.725013
accommodates 3.931295 1 1.982749
Based on this model, we learn that bedrooms and the size of the house are significant predictors of price_4_nights, which can be seen from a high absolute t-statistic. As the nr. of bedrooms increases, the price of the rental also increases. As house size increases, the price per person actually decreases (remember that we divided by “accommodates” when adjusting the price_4_nights variable). This makes sense, since the price is then shared among a greater number of heads. Beds is not a statistically significant predictor of price_4_nights. Interestingly, there is some multicollinearity between bedrooms, beds, and accommodates but not enough to disregard the model.
Comparing Model3 to Model2, we increase the adjusted R-squared to 0.26, which means that we can now explain more than a quarter of the variation in price. In Model4, we add the impact of the superhost variable (host_is_superhost) and check whether they can command a pricing premium, after controlling for other variables.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.439329 0.040453 134.459 < 2e-16 ***
number_of_reviews -0.000150 0.000100 -1.500 0.1337
review_scores_rating 0.017755 0.008030 2.211 0.0271 *
room_typeHotel room 0.752380 0.052930 14.214 < 2e-16 ***
room_typePrivate room -0.476599 0.012256 -38.887 < 2e-16 ***
room_typeShared room -0.843527 0.058286 -14.472 < 2e-16 ***
bedrooms 0.170023 0.012441 13.667 < 2e-16 ***
beds 0.005519 0.006753 0.817 0.4138
accommodates -0.121391 0.006104 -19.889 < 2e-16 ***
host_is_superhostTRUE 0.107250 0.015051 7.126 1.14e-12 ***
Residual standard error: 0.4561 on 6783 degrees of freedom
Multiple R-squared: 0.2635, Adjusted R-squared: 0.2625
F-statistic: 269.6 on 9 and 6783 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.189823 1 1.090790
review_scores_rating 1.023068 1 1.011468
room_type 1.252474 3 1.038233
bedrooms 2.291743 1 1.513850
beds 2.975834 1 1.725061
accommodates 3.934404 1 1.983533
host_is_superhost 1.188409 1 1.090142
Based on this model, superhosts charge a pricing premium, which can be seen from the positive coefficient and the high t-statistic. This makes sense, since these kinds of hosts are typically very professional in the way that they manage their apartments, which translates into higher customer value and thereby the ability to charge higher prices.
For Model5, we include the fact that some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4171221 0.0406347 133.313 < 2e-16 ***
number_of_reviews -0.0001967 0.0001003 -1.962 0.0499 *
review_scores_rating 0.0198574 0.0080273 2.474 0.0134 *
room_typeHotel room 0.7134056 0.0534283 13.353 < 2e-16 ***
room_typePrivate room -0.4778254 0.0122375 -39.046 < 2e-16 ***
room_typeShared room -0.8418858 0.0581870 -14.469 < 2e-16 ***
bedrooms 0.1716897 0.0124241 13.819 < 2e-16 ***
beds 0.0054498 0.0067418 0.808 0.4189
accommodates -0.1235199 0.0061083 -20.222 < 2e-16 ***
host_is_superhostTRUE 0.1045982 0.0150346 6.957 3.80e-12 ***
instant_bookableTRUE 0.0591527 0.0120019 4.929 8.48e-07 ***
Residual standard error: 0.4553 on 6782 degrees of freedom
Multiple R-squared: 0.2661, Adjusted R-squared: 0.265
F-statistic: 245.9 on 10 and 6782 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.200550 1 1.095696
review_scores_rating 1.025965 1 1.012899
room_type 1.280649 3 1.042089
bedrooms 2.293443 1 1.514412
beds 2.975847 1 1.725064
accommodates 3.954181 1 1.988512
host_is_superhost 1.189933 1 1.090840
instant_bookable 1.055243 1 1.027250
As can be seen from the summary statistics, the variable “instant_bookable” is also a significant predictor of price_4_nights. The regression analysis reveals that when controlling for the other listed variables, a rental with an instant-booking option is c. 6.09% more expensive than one without. The customer pays a premium for instant confirmation that the rental can be booked. The t-statistic for the variable is high and there is little multicollinearity with other variables, which is why it should be kept in the model.
For Model6, we look at neighbourhoods. For all cities, there are 3 variables that relate to neighbourhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighbourhoods in each city, and it wouldn’t make sense to include them all in the model. Instead, we manipulate the neighbourhood_group_cleansed variable and divide neighbourhoods into the following 4 groups:
City West: Steglitz - Zehlendorf, Spandau, Charlottenburg-Wilm. City North: Reinickendorf, Pankow, Lichtenberg City Central: Mitte, Friedrichshain-Kreuzberg City East: Marzahn - Hellersdorf, Treptow - Köpenick, Neukölln, Tempelhof - Schöneberg
This grouping is based on (i) the geographic location of the neighbourhoods and (ii) the judgement of a Berlin local. It pays special consideration for the particularly sought-after quarters of “Mitte” and “Friedrichshain-Kreuzberg”, which create their own group.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.481e+00 4.065e-02 134.819 < 2e-16 ***
number_of_reviews -2.709e-04 9.943e-05 -2.725 0.00645 **
review_scores_rating 1.979e-02 7.943e-03 2.492 0.01272 *
room_typeHotel room 7.074e-01 5.289e-02 13.376 < 2e-16 ***
room_typePrivate room -4.746e-01 1.213e-02 -39.121 < 2e-16 ***
room_typeShared room -8.705e-01 5.763e-02 -15.104 < 2e-16 ***
bedrooms 1.759e-01 1.231e-02 14.291 < 2e-16 ***
beds 5.963e-03 6.681e-03 0.893 0.37211
accommodates -1.250e-01 6.057e-03 -20.640 < 2e-16 ***
host_is_superhostTRUE 1.058e-01 1.489e-02 7.106 1.31e-12 ***
instant_bookableTRUE 5.999e-02 1.188e-02 5.049 4.55e-07 ***
areasCity East -1.660e-01 1.381e-02 -12.018 < 2e-16 ***
areasCity North -8.926e-02 1.459e-02 -6.116 1.01e-09 ***
areasCity West -5.389e-02 1.914e-02 -2.815 0.00490 **
Residual standard error: 0.4505 on 6779 degrees of freedom
Multiple R-squared: 0.282, Adjusted R-squared: 0.2806
F-statistic: 204.8 on 13 and 6779 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.205798 1 1.098088
review_scores_rating 1.026320 1 1.013075
room_type 1.290673 3 1.043445
bedrooms 2.299866 1 1.516531
beds 2.985379 1 1.727825
accommodates 3.972096 1 1.993012
host_is_superhost 1.192803 1 1.092155
instant_bookable 1.056423 1 1.027824
areas 1.025144 3 1.004147
The regression table confirms that neighbourhood is indeed a significant driver or price. City_Central is the base category for the analysis and is omitted in the model. Relative to this base case, all other neighbourhoods are cheaper. For example, an Airbnb in City_East will be c. 15.3% less expensive compared to the same apartment in City_Centre. Taking Berlin’s history into account, this makes logical sense. The eastern part of the city is the former DDR part, where prices tend to be lower.
For Model7, we include the effect of avalability_30 and reviews_per_month on price.
The variable “availability_30” is also a significant predictor of price_4_nights. The t-statistic is very high and the coefficient is positive, which means that, controlling for all the other variables, the impact of availability in the next month on price is positive.
The variable reviews_per_month does not seem to be a significant predictor as the t value is less than 2. This makes sense, since number of reviews per month are not necessarily related to the quality of the properties and therefore the price of the properties. A cheap rental could equally well have a high number of reviews per month as a medium-priced or more expensive rental. Therefore, this variable is removed from the final version of model 7, along with “beds” and “instant_bookable” which also have a t-statistic <2 and is thereby not statistically relevant.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3733604 0.0380309 141.289 < 2e-16 ***
number_of_reviews -0.0005111 0.0001092 -4.681 2.91e-06 ***
review_scores_rating 0.0380892 0.0074338 5.124 3.08e-07 ***
room_typeHotel room 0.4489061 0.0500189 8.975 < 2e-16 ***
room_typePrivate room -0.4757607 0.0113417 -41.948 < 2e-16 ***
room_typeShared room -1.0812471 0.0541773 -19.958 < 2e-16 ***
bedrooms 0.1898304 0.0114883 16.524 < 2e-16 ***
beds -0.0019668 0.0062329 -0.316 0.752
accommodates -0.1382611 0.0056616 -24.421 < 2e-16 ***
host_is_superhostTRUE 0.0978726 0.0139304 7.026 2.34e-12 ***
instant_bookableTRUE 0.0184907 0.0112407 1.645 0.100
reviews_per_month 0.0020915 0.0045276 0.462 0.644
availability_30 0.0237161 0.0007528 31.505 < 2e-16 ***
areasCity East -0.1639282 0.0128778 -12.729 < 2e-16 ***
areasCity North -0.0903394 0.0136028 -6.641 3.35e-11 ***
areasCity West -0.0996768 0.0178968 -5.570 2.65e-08 ***
Residual standard error: 0.4198 on 6777 degrees of freedom
Multiple R-squared: 0.3767, Adjusted R-squared: 0.3753
F-statistic: 273.1 on 15 and 6777 DF, p-value: < 2.2e-16
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.380e+00 3.782e-02 142.272 < 2e-16 ***
number_of_reviews -4.717e-04 9.252e-05 -5.098 3.53e-07 ***
review_scores_rating 3.775e-02 7.416e-03 5.090 3.68e-07 ***
room_typeHotel room 4.586e-01 4.955e-02 9.256 < 2e-16 ***
room_typePrivate room -4.759e-01 1.130e-02 -42.107 < 2e-16 ***
room_typeShared room -1.087e+00 5.335e-02 -20.379 < 2e-16 ***
bedrooms 1.886e-01 1.137e-02 16.591 < 2e-16 ***
accommodates -1.387e-01 4.459e-03 -31.113 < 2e-16 ***
host_is_superhostTRUE 9.917e-02 1.387e-02 7.149 9.67e-13 ***
availability_30 2.391e-02 7.356e-04 32.508 < 2e-16 ***
areasCity East -1.640e-01 1.287e-02 -12.743 < 2e-16 ***
areasCity North -9.022e-02 1.360e-02 -6.634 3.52e-11 ***
areasCity West -9.921e-02 1.787e-02 -5.553 2.92e-08 ***
Residual standard error: 0.4198 on 6780 degrees of freedom
Multiple R-squared: 0.3764, Adjusted R-squared: 0.3753
F-statistic: 341.1 on 12 and 6780 DF, p-value: < 2.2e-16
GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.202240 1 1.096467
review_scores_rating 1.030361 1 1.015067
room_type 1.285817 3 1.042789
bedrooms 2.259004 1 1.502998
accommodates 2.478903 1 1.574453
host_is_superhost 1.191883 1 1.091734
availability_30 1.115858 1 1.056342
areas 1.028985 3 1.004773
As the summary statistics above indicate, our final model comprises of statistically significant variables only and has an adjusted R-squared value of 0.378. This means that the model helps to explain 37.8% of the variation in the log-transformed price. At first, this seems like a mediocre model, since almost 2/3 of the variation in price remains unexplained. However, given the fact that rental prices are very subjective to their specific location (as opposed to the mere neighborhood), the quality of the amenities, the last date of redevelopment, and many other factors, we consider an R-squared of almost 40% as satisfactory. For instance, the addition of the simplified neighborhood variable only added c. 2 percentage points of explanatory power in our analysis and we are confident that in a future investigation one should put more emphasis on this variable and possibly consider factors such as “distance to public transport” or “distance to airport”.
Next to the looking at explanatory power, we should also analyze our model using RMSE. This analysis reveals whether the model actually works on unknown data, or whether it is overfitted to the specifics of the training data. The analysis below proves that model 7 is a good model based on two things. First, rmse_train value is small (0.419), which means predicated value and actual value are pretty close. Second, the difference (0.002) between rmse_train and rmse_test is small, which means it is a generalized model and can be applied to not only the training set, but also has the predict power to new data.
[1] 0.4193595
[1] 0.420724
In a future study, it would be interesting to apply the same model to different cities and test how it performs there. One can hypothesize, that in different regions of the world, some variables may have a particularly strong effect on price. For example, in regions that are more unsafe or more heterogeneous than Berlin, the neighborhood variable may be of greater significance. In this case, the RMSE would reveal that the model must be adapted because the accuracy in the test data set would be a lot lower than in the training data.
To provide an overview of the models that we worked with, we can create a summary table of the important parameters. From this table, we can see that between model 2 and 3, as well as between model 6 and 7, we could increase the explanatory power significantly.
| names | model1 | model2 | model3 | model4 | model5 | model6 | model7 | |
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
||
| 1 | (Intercept) | 5.25266712409174 | 5.22601665885266 | 5.41561688558926 | 5.43932901222654 | 5.41712213293432 | 5.4807295308379 | 5.38035768247879 |
| 2 | (0.0547065380247055) | (0.0396760545857061) | (0.0404639013814896) | (0.0404533541931739) | (0.040634655977797) | (0.0406525549744932) | (0.0378174954142131) | |
| 3 | prop_type_simplifiedEntire rental unit | -0.12364665827681 | ||||||
| 4 | (0.0378637424504366) | |||||||
| 5 | prop_type_simplifiedEntire serviced apartment | 0.36991753763553 | ||||||
| 6 | (0.0587047234449088) | |||||||
| 7 | prop_type_simplifiedOther | -0.139733278225219 | ||||||
| 8 | (0.040685880149076) | |||||||
| 9 | prop_type_simplifiedPrivate room in rental unit | -0.514469194654435 | ||||||
| 10 | (0.0381723484133216) | |||||||
| 11 | number_of_reviews | -0.000262921911620112 | -0.000211520763329157 | 0.000105803930695092 | -0.000149997844805636 | -0.000196720294384607 | -0.000270923661329036 | -0.000471674064243894 |
| 12 | (9.90589696676567e-05) | (9.66971788249199e-05) | (9.36860768971964e-05) | (0.000100008943511515) | (0.000100286739388637) | (9.94335820411547e-05) | (9.25222583866269e-05) | |
| 13 | review_scores_rating | 0.0441439588980378 | 0.0295955493917078 | 0.0246566795261023 | 0.0177552321297167 | 0.0198574191240427 | 0.0197948286100631 | 0.0377482631638796 |
| 14 | (0.00853823720066811) | (0.00834866989186208) | (0.00800025708539263) | (0.00802968902604007) | (0.0080272781569403) | (0.00794302753366303) | (0.00741641788537049) | |
| 15 | room_typeHotel room | 0.788477842085425 | 0.760282446227226 | 0.752379651964793 | 0.713405568886343 | 0.707412356810595 | 0.458586375417619 | |
| 16 | (0.0553870841612689) | (0.0531127207504659) | (0.0529305476706611) | (0.0534283424328574) | (0.0528862959499756) | (0.0495461370665299) | ||
| 17 | room_typePrivate room | -0.387331809035162 | -0.475330046913433 | -0.476599269268641 | -0.477825440497597 | -0.474645974821332 | -0.475880070203264 | |
| 18 | (0.0118796025811877) | (0.0122995477068085) | (0.0122559645717018) | (0.0122375057600838) | (0.0121327416944055) | (0.0113016001268893) | ||
| 19 | room_typeShared room | -1.01913934420833 | -0.866025763541964 | -0.843526909393009 | -0.841885838637578 | -0.870522151394415 | -1.08722783214392 | |
| 20 | (0.0594123872463682) | (0.0584133972218677) | (0.0582858435929289) | (0.0581869835462279) | (0.057634155801848) | (0.0533515323575201) | ||
| 21 | bedrooms | 0.169937852747 | 0.170022639549444 | 0.171689691194974 | 0.175909635557554 | 0.188597804520885 | ||
| 22 | (0.0124863482911214) | (0.0124407949329815) | (0.0124240955086151) | (0.0123087690296718) | (0.0113678059512343) | |||
| 23 | beds | 0.00587737440428875 | 0.00551901474087957 | 0.00544975642184052 | 0.00596295761612071 | |||
| 24 | (0.00677786616517768) | (0.00675332295561409) | (0.00674177273970815) | (0.00668053163925076) | ||||
| 25 | accommodates | -0.120168136942673 | -0.121390809025361 | -0.123519932917772 | -0.125011969758007 | -0.13872615240293 | ||
| 26 | (0.00612341721582402) | (0.00610348690171597) | (0.00610832970529868) | (0.00605684700640838) | (0.00445883448607025) | |||
| 27 | host_is_superhostTRUE | 0.107249584958159 | 0.104598242647659 | 0.105829377782435 | 0.0991690865578204 | |||
| 28 | (0.0150507461211306) | (0.0150345993809698) | (0.0148921545950237) | (0.0138721824889439) | ||||
| 29 | instant_bookableTRUE | 0.0591527056007092 | 0.0599880595583609 | |||||
| 30 | (0.0120019096026617) | (0.0118805223526592) | ||||||
| 31 | areasCity East | -0.166012261107213 | -0.164015218469992 | |||||
| 32 | (0.0138131602699419) | (0.0128713251550971) | ||||||
| 33 | areasCity North | -0.0892589982162856 | -0.0902168351296186 | |||||
| 34 | (0.0145944283275433) | (0.013599256931969) | ||||||
| 35 | areasCity West | -0.0538870006078655 | -0.0992121295451861 | |||||
| 36 | (0.0191442742557956) | (0.0178678129098993) | ||||||
| 37 | availability_30 | 0.0239129540141023 | ||||||
| 38 | (0.000735601225852363) | |||||||
| 1.1 | #observations | 6793 | 6793 | 6793 | 6793 | 6793 | 6793 | 6793 |
| 2.1 | R squared | 0.153100112178777 | 0.190775043172789 | 0.2579777222325 | 0.263491271521295 | 0.266119819438421 | 0.282010559928688 | 0.376420545786721 |
| 3.1 | Adj. R Squared | 0.152351305911915 | 0.190178885108234 | 0.257102695961548 | 0.262514037472009 | 0.265037719496573 | 0.280633680931649 | 0.375316865336786 |
| 4.1 | Residual SE | 0.488968080117755 | 0.477933099353734 | 0.45775906330601 | 0.45608883381524 | 0.455307795503466 | 0.450451057385405 | 0.419761353053853 |
We now apply the following criteria to find our target listings:
Review score value is higher than 90% of full score
number of reviews are larger than 10
It has a private room
The host identity is verified and is a super host
It is in the neighborhood of Friedrichshain-Kreuzberg
Thee are two beds.
We believe the best model is model 7 as it has the highest adjusted square and the lowest residual SE. We will apply model 7 for price prediction and an interval for the lower and upper bound. Based on the output below, we can see that the prices range from c. 130€ to 310€. However, the “lwr” and “upr” columns tell us that we the spread for each estimated price is extremely high. This should not come as a surprise, since the explanatory power of our model is limited to less than 40%. In the next chapter, we briefly discuss options on how to improve on this in a future analysis.
| fit | lwr | upr |
|---|---|---|
| 154.9289 | 67.94652 | 353.2627 |
| 133.4306 | 58.56466 | 304.0013 |
| 122.1938 | 53.62852 | 278.4216 |
| 144.5832 | 63.41973 | 329.6183 |
| 138.8953 | 60.96725 | 316.4308 |
| 244.3688 | 107.22133 | 556.9426 |
| 162.6726 | 71.40071 | 370.6178 |
| 129.3272 | 56.75620 | 294.6909 |
| 134.8729 | 59.20086 | 307.2710 |
| 160.6445 | 70.51440 | 365.9773 |
| 140.5367 | 61.68646 | 320.1767 |
| 139.7167 | 61.32606 | 318.3109 |
| 163.7481 | 71.86463 | 373.1105 |
| 309.3187 | 135.67159 | 705.2183 |
| 133.0535 | 58.39874 | 303.1439 |
| 162.3509 | 71.26135 | 369.8753 |
In this final section, we summarize the results of our selected model and discuss possible steps that could further improve the analysis.
As mentioned in the introduction, the overall goal of this analysis was to find a set of variables that would help us to predict Airbnb rental prices in Berlin. Our final model defines the following variables as statistically significant drivers of said rental prices:
The coefficients for each of these variables tell us how rental prices are impacted. For instance, the 1.9 coefficient for “Nr. of bedrooms” tells us that rental prices tend to increase with a higher nr. of bedrooms. The standard error for each of the coefficient estimates provides us with an idea of how far we are away from the “true” value of the coefficient. Whenever the the ratio of coefficient to standard error is >2, we can be relatively sure that the variable is in fact statistically significant. In our model, this is the case for all variables.
If we look at the p-value of the overall model (c. 2.2*e^-16), we notice that this is extremely small. This simply means that our overall model helps to explain rental prices with almost absolute certainty. As previously states, the adjusted R-squared (adjusted for the nr. of variables) tells us how much of the variation in price can be explained, where 37.8% is clearly substantial.
Our RMSE analysis also showed us that the model works well on different sub groups of the data. In a further analysis, it would be interesting to apply the model to other cities as well and compare how the explanatory power changes and if any variable becomes an insignificant predictor of price.
Additionally, the completeness of the data set could be improved in further analysis. We had to leave out thousands of rentals because they missed the relevant values for our chosen predictor variables.
Finally, we recommend to be aware of the impact of seasonality and weekday on prices. There are certainly some season where demand for rentals is particularly high (e.g., on national holidays or during summertime). The same goes for certain days of the week (e.g., the weekend being in higher demand than weekdays). In a next analysis, we would therefore like to focus on the impact of these time-related variable on the variation in price.
The data for this project is from insideairbnb.com.